Method And Apparatus For Using Global Snooping To Provide Cache Coherence To 
Distributed Computer Nodes In A Single Coherent System 



Cross-Reference to Related Applications 

The Mowing patent plications, all assigned to the assignee of ftis plication, 
5 describe rekted aspects ofAearrangenientaM operation of midtiprocesM 
according to this invention or its preferred embodiment 

U.S. patent pUcation serial niimber / . by T. B. Berg et al. 

(BEA919990003US1) entifled 'Method And Apparatus For Increasing Requestor Thioughp 
By Using Data Available Withholding" was filed on January 2002. 

10 U.S. patent application serial number _/___ by T.B. Berg etal. 

(BEA920000018US1) entitled "Multi-level Classification Method For Transaction Address 
Conflicts For Ensuring Efficient Ordering In A Two-level Snoopy Cache Architechire" was 
filed on January ,2002. 

U.S. patent application serial number / . by S.G. Lloyd et al. 
15 (BEA920000019US1) entitied "Transaction Redirection Mechanism For 

Handling Late Specification Changes And Design Errors" was filed on January __, 2002. 
U.S. patent application serial number / . by T. B. Berg et al 

(BEA920000020US1) entitled "Method And Apparatus For Multi-path Data Storage And 

Retiieval" was filed on January _^ 2002. 
20 U.S. patent application saial number / . by W. A. Downer et al. 

(BEA920000021US1) entitled "Hardware Support For Partitioning A Multiprocessor System 

To Allow Distinct Operating Systems" was filed on January , 2002. 

U.S. patent plication serial number / . bv T. B. Berg et al. 

(BEA920000022US1) entitied "Distributed Allocation Of System Hardware Resources 
25 For Multiprocessor Systems" was filed on January 2002. 
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U.S. patent application serial number / . by W. A. Downer et al. 
(BEA920010030US1) entitled "Masterless Building Block Binding To Partitions" was filed on 
January ,2002. 

U.S. patent application serial number _/_,^_ by W. A. Do\^^ 
5 (BEA920010031US1) entitled "Building Block Removal From Partitions" was filed on January 
_,2002. 

U.S. patent application serial number / , by W. A, Downer et al. 
(BEA920010041US1) entifled "Masterless Building Block Binding To P 
Identifiers And Indicators" was filed on Januaiy _^ 2002. 

5 Bach^ound Of The Invention 

Technical Field 

The present invention relates generdly to computer data cache schemes, and more 
particularly to a method and apparatus for niaintaining cohere 

system having distributed shared memory when such system utilizes multiple data processors 
5 capable of being configured into separate, independent nodes in a system utilizing non-uniform 

memoiy access (NUMA) or system memory which is distributed across various nodes. 

Description of the Related Art 

In computer system designs utilizing more than one processor operating simultaneoxisly 

in a coordinated manner, system memoiy which nmy be physicaUy configured or ass 
) with one group of such processors is accessible to other processors or processor groups in 

such system. Because of the demand for greater processing power within data pro^ 
systems, and due to the desire to have relatively small microprocessors work coopemtively 
sharing system components as a multi-processing system, there have been many attempts over 
the last several years to solve the problems inherent in maintaini^ 
' devices which are accessible to more than one processing device or more than one system 
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node when such nodes include multiple prcx;essors wMch share resowces and/or hard^ 
devices local to flie node. 

llie coherence problem is exeo:q)Ufied by a system in v^ch an inteixx)nnect^ 
crossbar communications channel is used to connect a plurality of memory devices, each 
5 memory device paired with an associated processor or a local group of processors forming a 
multi-processor system. A read or write data request fix>m a processor or group of processors 
acting as one node in such a multi-node system may be directed to addresses which are wifliin 
memory devices associated with the requesting processor or within a memory associated with 
another ofthe processor groups within the system. Each processor group is also associated 

10 with a local cache. Since each processor group has an associated cache, and each cache may 
have more tiaan one level, care must be taken to ensure the coherence ofthe data that is 
maintained throughout tile system 

One way to ensure that coherence is maintained is track <he state of each item of data in 
a directory (or register) which points to each non-local cache in which the data resides. By 

15 knowing tiie location ofeach copy ofihe data, each copy can either be iqxiated, or a notation 
can be made within the register to indicate that the data at one or more locations is out-of-date. 
Such registers or tables require pointers to multiple nodes which are caching data. All of this 
has the effect of slowing down system speed and therefore performance, because of 
con^onent latency and because the abihiy of certain systems to process multiple data lines 

20 simultaneously while waiting for data state indicators from other memory subsystems local to 
other systan processors is not fiilly utilized. 

In patents found in the related art, a problem observed with maintaining one table 

which points to each copy of a particular item of data within each node, is that it inaeased the 
con^lexily and tiie width of directcay entries wiftin such tables, making the table relatively large 
25 and complex. 

U.S. Patent Number 6,088,769, issued to Luick et al., discloses a multiprocessor 

cache coherence directed by combined local and global tables. While tilis reference defines a 
single global snooping scheme, it reUes on a single directory, and the LI and L2 caches filter 
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data references so that only some of such references reach the central global control unit 
Luick does not teach global snooping of all data references by all processors in a 
multiprocessOT system by a single level cental control device which fecilitate communicaticMis 
between multiple nodes in a multi-node processor system. Further, while Luick discloses 
5 checking data references against a private directory that is local to a particular node, it does not 

teach checking data references in a single global cache coherence table. Also, the Luick 
reference describes handling a cache coherence in a single processor generating data tags or 
references and does not teach its use in a multiple processor cluster which generates data tags 
in a node wMch contains its own memory and ii5)ut'ou^ut cqxabiUties that can flmction a^ 
1 0 stand alone processing system without the need to commumcale through a central control 
device which also acts as a communications channel to other independent processor nodes. 

U.S. Patent number 6,065,077, issued to Fu, teaches a method for sharing data caches 
where all the memory and processor devices are separately connected to a flow control unit 
which acts as crossbars between the processor and memory elements, and communicates with 
other crossbar or communication systems which themselves conUol their subs5^tem 
conponents. Fu does not teach a system which serves to coordinate data across a 
multrprocessor system, which itself utilizes multiple processors within a group or node which is 
enable of acting independently of a central control device or crossbar ^stem if necessary. 

The method in Fu requires all references to memory be satisfied by transferring such data fix)m 
20 a memory unit teough a flow control device acting as a crossbar to the requesting data 

processing unit. Fu does not teach a system whereby requested data by a processor could be 
satisfied totally by local memory to the particular node requesting such data, and does not teach 

a method by which only data requests with coherence impUcations are transferred between 
nodes. 

2 5 Another reference which disclose other methods of keeping track of the data 

maintained in various cache throughout a system are found in U.S. Patent Number 5,943,685, 
issued to Arimilli et aL, for a method of shared intervention of a single data provider among 
shared caches. U.S. Patent Number 5,604,882, issued to Hoover, et al., describes a system 
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and methc>d that the cache is for en^ty notification frcm peer cache 
control units in a multiprocessor system. The related background art does not teach the use of 
a central commumcatiom or control device wWch forwards results fix)m one node to another 
node in a multi-node system by simultaneously providing a read request to a first node and a 
5 write request to the second node vrith the results ofthe read request communicated to that 
second node without going tirough the central control or ccraimumcations device which 
communicates the data tagging and addressing infonnation to fte various nodes. 

Accordingly, it is an object of the present invention to provide a system and method for 

maintaining coherence of data stored in multiple caches within a mdtpocessor system 
I Q utilizes a central tag and address crossbar as a central communicati<Mis pathway wherein the 
central device is not required to maintain transitional or transient states for pending cache- 
related data requests in the system It is further an object of the present invention to provide a 
system and method for maintaining coherence of data stored in multiple caches in separate 
system nodes within a multiprocessor system having at least two nodes, wherein a data 
■5 requestor node is returned results of such a request fixmi the target node storing requested data 
witiiout being processed through tiie cental control device. 

It is yet anotiier object of the present invention to provide a system and method for 

maintaining coherence of data stored in multiple caches located in separate nodes within a 
multi-node multiprocessor system which utilizes a data tag and address crossbar contiol and 
:0 conmunications device wherein the central device controUing communications of te^ 

information fi-om a first node to a second node simultaneously sends a read request to a first 
node and a data write request to the second node with the results of the data read request being 
communicated to such secondnode without tansmission through the tag and address crossbar. 

Summary Of The Invention 

5 The present invention provides a method and ^>paratus for use in computer systems 

utilizing distiibuted con^utational nodes where each node consists of (me or more 
microprocessors and each node is capable of operating independentiy with local system 
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memory and control systems, where all the nodes are interconnected to allow operation as a 
multi-node system. The method provides for maintaining cache coherence in multi^^ 
systems wMch have a pliirafity of nodes coi?)led by an intercom 

pathway such as a tag and address crossbar system and a data crossbar system. The method 
5 operates with a tag and address crossbar system which is capable of storing information 
regarding the location and state of data within the system when the system 
capability to access data from the memory system of any node. The method includes the steps 
of storing information regarding the state of data in said interconnecting pathway; checking said 
stored information to determine &e location of the most cunmt copy of a requested p^ 

10 data, in response to a request by a requesting node for the requested portion of data; retrieving 
said current copy of requested portion of data and directing said data to the requesting node; 
checking said stored infonnation to detennine the location of the requested dato^ 
directing flie system to send said requested data to the requesting node witiiout going through 
the said interconnecting communications pathway 

15 The apparatus includes a miiltiprocessor system comprised of two or more no^^ 

least one processor each, each node including part of a shared, distributed memory system 
used by tiie processors. The nodes are interconnected by a communications pathway which 
includes means to store tiie location and state of data stored across flie system in the distributed 
memory. The preferred embodiment reduces latency in data flow throu^out the system by 

20 storing the location and state of requested data or other location and state iiiformatioi^ 

and address crossbar device wMch examines cache line States for each tine in al^ 
simultaneously. Appropriate replies back to a node requesting data or otiier requests are then 
issued based on such information stored in the tag and address crossbar system which is acting 
as tile communications pathway between tiie nodes. 

25 Other features and advantages of this invention will become apparent from tiie following 

detailed description of the presently prefraed embodiment of flie invention, taken in 
conjunction with the accompanyirig drawings. 
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Brief Description Of The Drawings 
Fig, 1 is a block diagram of a typical multiprocess^^ 
crossbar system in conjunction with a data crossbar system which operates with the present 
invention and is suggested for printing on the first page of the issued patent 
5 I*^- 2A-2C is a block diagram of the tag and address crossbar system connecting each 

quadrant or node iQ a multiprocessor system in which the invention is used 

Fig* 3 is a block diagram of one quad processor group illustrating functional 
components of one group and the relationship of cache and remote cache memory in the 
present invention. 

1 0 Fig. 4A-4D is a table illustrating tiie various states of cached reads and read-invalidates 

used in the preferred embodiment 

Fig. 5A-5B is a table iUustrating uncached read requests in the system used in t^^ 
preferred embodiment 

Fig 6A-6B is a table iUustrating uncached writes requests in the system used in ti^^ 
15 preferred embodiment 

Fig. 7 is a table iUustrating reads and writes to memory mapped input/oulput, CSRs, 
and non-memory targets in the system used in the preferred embodiment 

Fig. 8A-8B is a table illustrating rollout requests in the system used in the preferred 
embodiment 

20 Fig. 9A-9C is a table ofthe mnemonics for the fields used for aU the input and ou^^^ 

buses for for the tag and address crossbar apparatus as used in the preferred embodiment and 
provides tiie references used in Figures 4, 5, 6, 7 and 8. 

Fig. 10 is a block diagram of tiie mapping of the remote cache tags. 
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Detailed Description Of The Preferred Embodiment 

Overview and Technical Background 

The present invention relate specifically to an improved data handling method for use 
in a mdtiple pK)cessQr systeni, configiKcd in two or more independ 
5 utilizes a tag and address crossbar system for use in combination 

together comprising a data processing system to process multiple data read and write requests 
across tiie nodes. In such systems, a particular processor or node cannot know the cache line 
States of data that exist on other nodes, they need to access the latest copy of such data within 
the memory of other nodes for the system to operate coherently. In such multi-node systems, a 
10 method for system-wide cache coherency must be adopted by assure that a data cache 
current for a microprocessor or node that requests a given data cache Im^^ Thesystem 
disclosed provides a hierarchy of serialization of any requests to a given cache line before 

allowing any node in the system to access that cache line so that cache coherence is n^ 
S| throughout all nodes in the system which may be operating together. 

15 The method and apparatus described utilizes global snooping to provide a single point 

of serialization. Global snooping is accomplished in the invention by providing that all nodes 
within the system pass aU requests for data to a centralized controUer wMch TO 
each node in the system and maintains centralized cache state tags. The central controller, 
which is a data tag and address crossbar system interconnecting the nodes, examines the cache 
20 state tags of each line for all nodes simultaneously, and issues the appropriate reply back to a 
node which is requesting data. The controller also generates other requests to other nodes for 
the purpose of maintaining cache coherence and supplying the requested data if appropriate. 

The preferred embodiment divides all of the system's memory space, associated with 
each node of one or more microprocessors, into local and remote categories for each node. 
25 Each node owns a unique portion of the total memory space in the entire system, defined as 

local to that node. Thetotalofallsystemmemoryiscompletelyownedby exactly one of the 
collection ofnodes in the system. Local and remote categories for any cache line are mutually 
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exclusive, whereby aU cache Imes m the system timt are not 

defined as remote to that node. The invention provides support for a third level remote cache 
for each node of the system, wherein each node caches remote data lines specified by the 
controller. Each memory board associated with each node or groiq) of processors is itself 
5 divided into local memory, with a portion of the memory being defined as remote cache. The 
portion of each local memory system defined as remote cache is operating as a ted leve^ 
cache. The invention provides for anticipation of a requested data line by providing a means for 
the system to know in advance wheflier a requested line may be located in local memory for a 
particular node, or Ihe portion of the local memory which is defined as remote cache which may 

10 be caching another node's local memory. 

The disclosure provides that each node in the system may request data either from local 
memory or remote cache memory when a request for such data is made since the controller 
knows in wMch category the line is defined and the controUer can either verify 
of the anticipated data line when it has completed its coherency checks. If the subject data line 

15 that was anticipated and read in advance is coherent, then the node requesting 

that line. If the line is not coherent, then the controller will have forwarded a request to the 
appropriate node, and the requesting node discards the anticipated data line a^^ 
the line return due to the request made by the controller. 

The use of a remote cache is provided to reduce the latency of cache lines that would 

20 otherwise have to be obtained fi-om another node in the system. The existence of the disclosed 
remote cache provides coherency states that eliminate the need for some node to node data 
transmission by allowing "dirty*' cache lines to exist in nodes as read only copies. The existence 
of such "dirty*' read only cache lines held in the remote cache delays the time whe 
line must be restored to its original memory location, thus enhancing the poss^^^ 

25 cache state transitions will eliminate the need to restore the cache line to its original memory 
location thiis saving otherwise mmecessary system traiisactions wW 

expenditure of bandwidth. If such remote cache evicts such a "dirty** data line, and the system 
node evicting the nominal owner of that data line, the controller reassigns a different sharing 
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node as the new owner of that data line without any actual data movement within the system. If 
no node shares ttie data other than the node evicting the ''dirty" line, then the line is restored to 
its memory location. 

In the invention, system memory is defined for a given computational node as the 
aggregate memory of all nodes allocated to a partition within the system which the given node is 
a member. Therefore, in systems in which more than one partition of computational nodes 
exist, the invention and system described operate on, and allocate the system memory only 
across those nodes defined as operating within a single partition within the system. 

Details of the Preferred Embodiment 
Fig. 1 presents an example of a typical multiprocessor systems in which the present 
invention may be used Fig^l illustrates a multi-processor system which utilizes four separate 
central control systems (control agents) 66, each of which provides input/output interfacing and 
memory control for an array 64 of four Intel brand Itanium class microprocessors 62 per 
control agent 66. In many appHcations, control agent 66 is an application specific integrated 
circuit (ASIC) which is developed for a particular system application to provide the interfacing 
for each microprocessors bus 76, each memory 68 associated with a given control agent 66, 
PCI interface bus 21, and PCI input/ou^ut interface 80, along with the associated PCI bus 74 
which connects to various PCI devices. Bus 76 for each microprocessor is connected to 
control agent 66 through bus 61. Each PCI interface bus 21 is connected to each control agent 
66 ttirough PCI interface block bus 20, 

Fig. 1 also illustrates the port connection between the tag and address crossbar 70 as 
well as data crossbar 72. As can be appreciated fiiom the block diagram shown in Fig. 1, 
crossbar 70 and crossb^ 72 allow communications between each control agent 66, such that 
addressing information and memory line and write information can be communicated across the 
entire multiprocessor system 60. Such memory addressing system communicates data locations 
across the system and facilitates update of control agent 66 cache information regarding data 
vahdity and required data location. 
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A single node or "quad" processor gcoxxp 58 is comprised of microprocessors 62, 
memoay 68, and control agent 66. In multiprocessor systems to which the present invention 
relates, quad memory 68 is usually Random Access Memory (RAM) available to the local 
control agent 66 as local or home memory. A particular memory 68 is attached to a particular 
5 controller agent 66 in the entire system 60, but is considered remote memory when accessed by 
another quadrant or control agent 66 not directly cormected to a particular memory 68 
associated with a particular control agent 66. A microprocessor 62 existing in any one quad 58 
may access memory 68 on any other quad 58. NUMAsysteins typically partition memory 68 
into local memory and remote memory for access by other quads, the present inventi^^ 

10 enhances the entire system's ability to keep track of data when such data may be utilized or 
stored in memory 68 which is located in a quad 58 different from and therefore remote from, 
quad 58 which has a PCI device which may have issued the data. 

Fig. 3 is a different view of the same multiprocessor system shown in Fig. 1 in a 
simpler view, illustrating one quad 58 in relation to the other quad components as well as 

15 crossbar system 70 and 72 iUustrated for sirr5)Hdty as one uiiitm Theinvention 

disclosed defines a certain portion of memory 68 located in each node or quad 58 as remote 

cache 79. The portion of memory 68 operating as local memory acts as home memory for the 

particular quad 58 in which it is associated, while remote cache 79, part of memory board 68 

but defined as a remote cache, operates as a remote cache for other nodes in the system. As 

can be seen on Fig. 3, remote cache 79 is different than cache 63 which is normally associated 

witii a particular processor 62. Cache 63 is normally on tiie same substrate or chip of 

processor 62 and can be divided into what is often referred to as level 1 (LI) and 1^^^ 
cache. 

The tag and address crossbar 70 and data crossbar 72 allow the interfaces between 
25 foin* memory control agents 66 to be intercomected as shared memory cornmon operating 
system entities, or segregated into separate instances of shared memory operating system 
instances if the entire system is partitioned to aflow for independenfly operating syst^ 
the system disclosed in Fig. 1. The tag and address crossbar 70 supports such an architecture 
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by providing the data address snoop function between the microprocessor bus 76 on different 
quads 58 that are in a corranon operating system instance (partition). In support of the snoop 
activity, the tag and address crossbar 70 routes requests and responses between flie memory 
control agents 66 of the quads 58 in a partition. Each partition has its own distinct group of 
5 quads 58 and no quad can be a part ofmore than one partition Quads of different partitions 

do not interact with each other's memory space; the invention will be described below with the 
assertion that all quads in the system are operating within a single system partitioa 

Tag and address crossbar 70 receives inbound requests across each crossbar 70 bus, 
shown as a single instance 41 for port 1 in Fig. 1. Tag and address crossbar 70 processes 

10 inbound data requests in either the even tag pipeline 52 or odd pipeline 53 detailed 

sends a reply back on output 46 to the requesting quad 58, and sends outbound data 
request(s) to other tag and address crossbar 70 output busses, 45, 47 or 48, if necessary. Tag 
and address crossbar 70 allocates Transaction Identifications (TrIDs) for its outbound requests 
for each destination memory control agent 66 and for an eviction to the requesting node if 

15 necessary. The memory control agents 66 releases such TrIDs for reallocation when 

appropriate. The tag and address crossbar 70 reply to a memory control agents 66 request for 
data does not necessarily complete the transaction. If tiie target address for requested data is 
local to the requesting quad 58 and no remote caches hold the line as modified, then t^^ 
address crossbar 70 repUes with GO, (as shown in tiie tables of Figs. 4, 5 and 6), and memory 

20 conliol agent 66 uses data &om its local memory 68 to complete the read to the processor 62 
requesting the data. 

If tiie data is modified in another quad's remote cache, tiien tag and address crossbar 
70 repUes with WAn^, and tiie requesting memory control agents 66 suspends the request 
tiie data crossbar 72 supplies read data. When tiie Tag and address crossbar 70 issues tiie 
25 WATT reply, it also issues a read request outbound to the target memory control agents 66 that 

owm the (^he line, including fee quad identifier (quad ID) aiKi TrH) of the (m 
(source) memory control agents 66. At fliis point, the tar^ control agent 66 gets the target line 
from its memory board and sends it to flie data crossbar 72 with the source quad ID and 
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source TrID attached Data crossbar 72 uses the source quad ID to route the data to the 

source control agent 66 where it can be delivered to the requesting processor by observing t^^ 
source TrID value returned with the data. 

The tag and address crossbar 70 serializes all requests in system 60 into two streams of 
5 addresses that it sends to its even and odd tag pipelines shown in Fig. 2. Fig. 2 is presented in 
three parts as Fig. 2A,2B and 2C for clarity but represents one diagram. Each pipeline sends 
the addresses to the external SRAM Remote Cache Tags (RCT) to look up the global cache 
state. Since each SRAM component stores RCT entries for only its own port i.e., its own 
memory control agents 66, the read of aU four SRAMs constitutes a read of 

10 that index. Entries from ports that are not a member of the partition making the request are 
ignored. A cache line is home to (i.e. local to or owned by) exactly one quad 58, so at least 
one port should yield a tag logic miss (tag mismatch or state of I). The order of the infonnation 
within a pipeline is always preserved 

^Vhen tag logic 57 determines that a RCT entry must be modified (due to state, tag, or 

15 ECC fields), the tag and address crossbar 70 schedules a write to the external SRAMs through 
the write buffer located in tag comparator and dispatcher 84 and 85. To prevent conflicts 
where a new access could be looked up while a write is pending in flie write bxiff^^^ 
buffer entries are snooped. A snoop hit (a valid address match) in the write buflf^ causes the 
lookup to stall, and the write buffer contents are streamed out to the SRAMs. Lookups that 

20 are progressing through the tag pipeline may eventuaUy result in tag writes, s^ 
snooped as well. Snoc^ Wts to these entries also cause a staU until the conf^ 
(including draining the write buffer, if necessary). 

The tag and address crossbar 70 manages the direct mapped Remote Caches by 
aUocatbg the entries and maintaining their system state. Ifa request requires a RCT entry and 

15 that entry already is being used, the old entry must be evicted, this operation sometimes refered 
to as a rollout in the art. Tag and address crossbar 70 does this by issuing an invalidate or 
read-invaUdate to the control agent 66 that made the request timt caused the e^^ 
called the instigator). This rollout request that tag and address crossbar 70 makes is in addition 
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to and prior to the origiiial (ins%ator) request being sent to 
66. 

As a system is cx)nfigured with virtiMy identic 
partitioned as a single system or up to four separate partitioned systems using the method 
5 disclosed In the preferred embodiment, the maximum total number of quads 58 is four, as 
configured in Fig, 1. Every port oftag and address crossbar 70 is assigned to one of the four 
control agent 66 by virtue of its physical connection between agent 66 and crossbar 70. 
hiterconnections between tag and address crossbar 70 and data crossbar 72 to each of control 
agents 66 are accomplished through bus 71. Shown in Fig* 1 as a connection fix)m tag and 
10 address crossbar 70 and data crossbar 72 to the control agent 66 in quad one, the bus is also 
^ referred to as a port. Though shown only at quad one, the configuration of bus 71 is duplicated 

nil for each quad 58 as can be appreciated by the connections for ports 0, 1, 2 and 3 shown in 

Fig, 1. Bus 73 is the portion of bus 71 that connects control agent 66 to tag and address 
crossbar 70. Bus 75 is the portion of bus 71 which connects the data crossbar 72 to each 
15 control agent 66. Each of the quads of the system demonstrated in Fig, 1, communicate to the 
remaining portions of the system through tag and address crossbar 70 as well as data crossbar 
72 through channels defined as ports. Ports 0, 1, 2 and 3 are all shown on Fig, 1 
interconnecting the crossbar systems with the control agent 66 through input and output 
portions of each port, interconnecting each crossbar to each given quad All of the quads 58 in 
20 Fig. 1 are connected in a similar fashion, as can be appreciated from the figure, utilizing 

interconnect bus 71 as shown in port 1 of Fig. 1. The crossbar system including the ports 
interconnecting the crossbars with each of the quads 58 is essentially a communication pathway 
connecting the processing nodes. Fig, 2 illustrates internal logic oftag and address 
crossbar 70 shown in Fig. 1, Input 40 for port 0, input 41 for port 1, input 42 for port 2, and 
25 input 43 for port 3 illustrate part of the communications pathway each control agent 66 in each 
quad or node into tag and address crossbar 70. Likewise, Fig, 2 illustrates port 0 ou^ut 45, 
port 1 output 46, port 2 output 47 and port 3 output 48 also illustrated on the entire system 
block diagram shown in Fig, 1. Tag look-up registers which fimction with tag and address 
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crossbar 70 are shown at 81(a) and 81(b) . Registers 81(a) and 81(b) are identical except that 
Aey are assodated with an even pipeline and odd pipeline for tag processing as 
Fig. 2. The dual pipeline design is provided to reduce latency in the syst^ 
processing to even nunibered tags to the even pipeline and odd numbered tags to the odd 
5 pipeline so that simultaneoxis processing may occur. 

Mput 40, 41, 42 and 43 are each introduced through a buffer, are operatively 
connected to both even input multiplexor 50, and odd input multiplexor 51, the appropriate 
multiplexor (mux) being selected iu accordance with the even or odd relationship with the inpiA 
|2 ^& Each multiplexor 50 and 51 serves to serialize the flow of tags from the four inputs. The 

'3 10 outputs of muliplexor 50 and 51 are sent to another multiplexor to be sent ultimately to tag 
i^fl look-up registers 81(a) and 81(b). Even pipeline logic 52 and odd pipeline logic 53 evaluates 

ill the tags being presented and the request type to generate an 

ports that are connected to a defined quad within its partition. The resulting output entries are 
P buffered in the dispatch birEfb* 54 and 55 wMch is a first 

Q 15 Dispatch buffers 54 and 55 decouples timing variances between the tag logic shown and tiae 
Q output selection logic. Entnes are stored in dispatch buffers 54 and 55 in first in, first out order 

' until they can sent to tile destination ports, being ou^ut 45, 46, 47 or 48, represen^^ 

output to each port or quad 

Tag look-up register 81(a) and 81(b), identical in configuration, are made up of four 
20 Synchronous Static Random Access Memory (SSRAM) chips, a total of four each 512 kbits 
by 16 bits. Tag look-up register 81(a) is connected through line 82(a) to even tag comparator 
and dispatcher 84. Though shown as one connection in Fig. 2, connection 82(a) is actually 
four paths, each corresponding to inputs 0, 1, 2 and 3 fi-om each port as described. Register 
81(b), connected to the odd tag comparator and dispatcher 85 through connection 82(b) is 
25 essentially identical in function. Path 82(b) is likewise comprised of four paths, each 

corresponding to a port. Tag look-up registers 81(a) and 81(b) are external memory which 
interfaces with tag and address crossbar 70 used to store the tag and state information for all of 
the remote cache tags in the entire system. Such information is not directiy accessable by 
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memory control agent 66, so all cacheable transactions generated in control agent 66 must 
access crossbar 70 to access or "snoop" crossbar 70's remote cache tags (RCTs). The 
physical configuration ofregister 81(a) and 81(b) is illustrated in the block diagram shown in 
Fig. 10. As shown in Fig. 10, register 81(a) and 81(b) is implemented with synchronous static 
5 random access memory chips (SSRAM) which operate at the intemal clock frequency of 

crossbar 70, being 133 MHz in the present invention. As can be seen also in Fig. 10, there are 
two groups of external SSRAMs, the groups being divided to odd and even pipelines as shown 
on Fig. 2, Each group ofregisters 81(a), 81(b) is split into four "quadrants", with each 
quadrant representing a physical port of crossbar 70. As there are a total of four ports in 

10 preferred embodiment as shown in the system diagram of Fig. 1, it can be appreciated that 
each port corresponds to a physical quad in the present invention, as earlier described. 
Therefore, each port of tiie RCT interface represents the RCTs for a physical quads remote 
cache as is illustrated in F^. 10 and each quadrant of the tag look-up registers 81(a) and 81(b) 
contains the tag and state information. 

i 5 Turning now to the remote cache, the remote cache states, displayed in Table 1 below, 

shall be described in accordance with the operation of tiie invention in the prefered 
embodiment. Tag and address crossbar 70 maintains direct-mapped cache tags for the 
remote cache for remote addresses. The tag entries have an address tag portion and a state 
portion (there is also 6 check bits for SEC/DED protection). The possible remote cache state 

20 values are: I, S, or M (for invalid, shared, and modified). Each port, (port 0, 1 , 2 and 3 as 
shown in Fig. 1) on the tag and address crossbar 70 has a corresponding even and odd tag 
SRAM array for these cache tags. For ports that share the same partition ID, tiie 
corresponding cache tag quadrants form a collective remote cache tag state. Tag quadrants of 
two different partitions (if there is at least one node operating in a separately defined partition) 

25 have no impact on each other except for tiie physical sharing of SRAM address pins (which 
forces accesses to be serialized). The collective tag state for an address is tiie state of all 
quadrants at fliat index in the requester's partition whose tag address rnatches to 
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As described above, the possible collective states used in the present invention are: 
invalid, shared, dirty, and modified. For these collective states: 

1 . invalid means all quads in the partition have either an I state or a tag mismatch at that 

index; 

5 2. shared means that at least one quad matches its tag and has an S state at the index 

(but none matches and has an M state); 

3. dirty means that exactly one quad matches with an M state and at least one matches 
with an S state; and 

4. modified means that exactly one quad nmtches with an M state and aU other qu 
10 are invalid 

The dirty state impKes that memory at the home quad 58 is stale, that all qm 
hold it as sharcd (S) or modified (M) have an identical copy, and that no processor 62 h^ 
modified copy in its intemal cache. The tag and address crossbar 70 performs a read access 
to all four tag quads of the even/odd tag array whenever the even/odd pipeline processes an 
15 inbound request for an even/odd memory address. Processing of the rcquest and the resultant 
lookup may require an iipdate to the tags. The cache line is protected against subsequent 
accesses to the cache line while a potential update is pending. Memory-Mapped Input/Output 
(MMIO) addresses and requests to non-memory targets do not require a lookup, but still 
consume a pipeline stage. 

20 Tag and address crossbar 70 throttles inbound requests using tiie credit/release 

mechanism. Control agent 66 assumes that a "credif ' number of requests can be sent and will 
not allow fiirther requests when the credits are expended. Crossbar 70 returns the credit with a 
credit release, which allows the credit to be re-used 

Address conflicts in the write buffer or the tag pipelines of the tag and addr^ 
25 70 can stall progress in the tag pipelines until the conflict is resolved Alackof TrIDsmay 

delay movement of a pipeline entry (token) from the tag logic into tiie dispatch buffer 54. If the 
dispatch buffer 54 or 55 is fijU, a new entry cannot be entered into the buffer. There are also 
certain errors (such as SEC correction) tiiat stall the pipeline for one clock. For these reasons, 
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a pipeline entry could be delayed at tiie ii^ut to the dispatch buffer 54 or 55 and thus cause the 
pipeline to become blocked In the case of insufficient TrIDs, the entry is delayed for a period 
oftime programmable) to wait for a TrID to become avaflable. Ifthe delay period ejq)iies, the 
entry will be retried instead, hi this event, flie token is converted to a retry and placed into the 

5 dispatch buffer 54 or 55 as a response to the requester (errors are treated in a similar manner: 
replacement ofthe original token with an error token). Conversion to a retry or error allows 
the entry to be placed into the dispatch buffer 54 or 55 , and the pipeUne can then advance. 
Such a situation cancels any external tag updates tfiat may have been scheduled or TrIDs that 
may have been allocated. If an entiy is delayed at flie input to the dispatch buffer 54 or 55 , 

0 any external tag read operations need to be queued when they rehim to crossbar 70 . These 
queued read results (called the stall collectors) must be inserted back into the pipeline in the 
correct order when the dispatch buffer 54 or 55 becomes unblocked. 

hibound requests are throttled only by the credit/release mechanism, hibound 
responses are not defined. Oulboundrequestsarethrottledbytheavailabihty ofaTrlD inthe 

5 target control agent 66 (as determined in the tag and address crossbar 70 's TrID allocation 

logic). Outbound responses have no throttling mechanism and the control agent 66 accepts any 
and all crossbar 70 responses. All outbound responses or requests in an entry in dispatch 
buffer 54 or 55 must be referred to their respective ou^ut simultaneously. Therefore, dispatch 
buffers 54 and 55 must con^ete for availability of output 45, 46, 47 and 48 if they conflict. 

D Furthermore, an output response to an output port may be accompanied by a rollout (eviction) 
request, and both must be sent to that output port, rollout first 

Tag and address crossbar 70 receives requests from the control agent 66 and reacts 
based on the state ofthe cache line in each quad's remote cache tag. The aggregate of the 
quad remote cache (RC) states is called the global remote cache (RQ state. M this table, the 
) combinations oflegal state values is abbreviated as 3 characters of I, S, and Mas described 
earlier. The second colunm is flie state of the requester's RC tag, tiie third column is ttie state 
of other quads except for the owner, and the fourth column is the state ofthe owning quad 
The global RC state has the state name given in column 1 of table 1 . Local requests should 
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always mismatch or have an I state. It should be appreciated that local requests imply that the 
requester is in state I since local addresses are not cached in the remote cache. In rows where 
the Req state is I, the hne is not presort in that quad's remote cache. The I state means no tag 
match occurred, the state was I, or that the port is not a member of the partition making the 
request All four of the Dirty states imply that the processors in the quad holding the line in the 

M state have unmodified data in their on-chq) (L1/L2) caches. 



Table 1 

Global Remote Cache State Definitions 



1 State Name 


1 Req 


1 Sharer 


1 Owner 


1 Comment 


Memory | 


Invalid 


I 


I 


I 


line is Home 


Clean 


SharedMiss 


I 


S 


I 


line is clean shared, but misses its own 
RC 


Clean 


SharedHit 


S 


I 


I 


line is clean hit, but exclusive to 
requester's RC 


Clean 


SharedBoth 


S 


S 


I 


Msbaring 


Clean 


EtoyMiss 


I 


s 


M 


line is modified in another RC, but only 
shared in processor caches 


Stale 


DirtyHit 


S 


I 


M 


requester already caching, no 3"* party 
sharers 


Stale 


DirtyBoth 


s 


s 


M 


£1 sharing and modified in owner's RC 


Stale 


DirtyMod 


M 


s 


I 


requester is also owner 


Stale 


ModMiss 


I 


I 


M 


exclusively owned by another quad 


Stale 


ModHit 


M 


I 


I 


requester is also owner 


Stale 



When the requester is flie owner, table 1 assigns Req tiie value M and shows the owner 
as state I. Such state names are used in the figures to demonstrate how tag and address 
crossbar 70 reacts to bus 73 requests. It should be also appreciated that bus 73 is illustotive 
of each bus of tag and address crossbar 70 for each port shown in Fig. 1. 

Kg. 4 illustrates cached reads and read-invahdates. Fig. 4 is presented in four parts as 
Fig. 4A, 4B, 4C and 4D for clarity but represents one diagram. Fig. 5 illustrates uncached 
reads. Fig. 5 is presraited in two parts as Fig. 5A and 5B for clarity but represents one 
diagram. Fig 6 illustrates uncached writes. Fig. 6 is presented in two parts as Fig. 6A and 6B 
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for clarity but represents one diagram. Fig. 7 illustrates reads and writes to MMIO, CSRs, and 
non-memory targets. Fig. 8 illustrates rollout requests. Fig. 8 is presented in two parts as Fig. 
8 A and 8B for clarity but represents one diagranL M of the mnemonics for Ihe fi^^ 
aU the input and output buses for crossbar 70, (shown in one instance as bus 73 in Fig.l), are 
5 iUustrated in Fig. 9 and niay be used as references in the re\dew of the fi 

nmemonics in the iUustrationoflhe operation of the prcferedembodime^^ Fig. 9 is presented 
in three parts as Fig. 9A, 9B, and 9C for clarity but represents one diagram^ Reference will 
now be made to such figures as the operation of the preferred embodiment will be illustrated. 

10 Figs. 4, 5, 6, 7, 8 and 9 illustrate the operation of the preferred embodiment and can 

be used to understand the actual operation of the method and tiie system disclosed herein. 
Considering Fig. 4, the table illustrated describes various states of cached reads and read 
invalidates which fully explain the implementation of the present inventioiL Fig. 4 may be used 
to illustrate any data requests imtiated by any of the four ports, and used as an ex^ 

15 define the results and the response to all other ports for a given iiq)ut Column 101 of Fig. 4 

contains the various types of bus requests. Column 101 includes a request for a cached read to 
a local address (LCR), a requested for a cached read to a remote address (RCR), a request 
for a cached read invalidate to a local address (LCRI); and a request for a cached read- 
invalidate to a remote address (RCRI). An illustrative exanq)le win be used to demonstrate the 

20 operation of the invention assuming that the bus 73 request in column 101 is originating in port 
1, thereby relating to inpnt 41 as shown in Fig. 1. 

For the purpose of the present exan^le, column 101 represents input 41 on Fig. 2. In 
such a case, for each global remote cache state in column 102 the tag and address crossbar 70 
respoiise to such a request is given in colunm 103, and such response is directs 

25 port to which the request was made. In the present example, a request on input 41corresponds 

to output 46 handUng the response to the request. Likewise, columns 104, 105 and 106 refer 
to output 45, 46, 47, or 48 in Fig. 2, as fliey relate to home, sharer or owner. In the present 
example, the request to home in column 104 is one of the other outputs other than 46. A 
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request to sharers in colimin 105 necessarily excludes fbe request to home quads in column 
104. Home spedfied in colurnn 104 is the quad where an address is local, or means an 
address is local to the particdar quad. Ctolumn 106 depicts fc^ 

to the quad which is defined as the owner of the data in questioa In the example used for a 
5 request received fix)mqmdl,colmnn 106 is by deMtion to one of the other 

oulput 46 which is associated with port 1. Column 107 defines the next global remote cached 
state and column 108 defines whether the remote cache allocate and/or rollout should be 
associated with flie particular bus request in column 101 . A yes in column 108 means a 

relocation may be necessary, which may further mean to 
10 required 

Fig. 5 is a table illustrating uncached read requests with similar vertical column 
definitions as those depicted in Fig. 4, Fig. 5 provides the various states and responses for the 
defined quads for each request for a local uncached read and for requests for a remote 
uncached read Fig. 6, once again with similar column headings, provides defined solutions for 

15 a request for a local uncached partial write, reqxiest to crossbar 70 or for remote uncached 

partial write requestor for a local uncached fiiQ line write request or for a r^ 
line write, hi the preferred embodiment, a M cache hne is defined as 32 bytes, being the unit 
of coherency used in the embodiment Partial lines are requests made in the sy^ 
involve less than 32 bytes and are, accordingly, considered partial lines. 

^0 Fig. 7 is a table illustrating reads and writes to memory mapped I/O locations, CSR's 

and non-memory targets, the definitions for wMch are contained in the mnemonic descri^^^ 
contained in Fig. 9. Fig. 8 is a table that defines the requests that must be sent to the home, 
sharer, and owner nodes to accomplish an eviction of a cache liae. If column 108 of Fig; 4 
shows a YES for the (instigating) request and tiiere is a vaHd pre-existing entry in the remote 

15 cache, then that entry must be evicted in order to make room for caching of the (instigating) 

request Thus, Fig. 8 defines activity assodatedwi& a different memory location (but the s^ 
cache entiy), and occurs as a side effect of the (imtigating)reque^^^ Fig. 9 contains the 
reference mnemonics used in the Figures. 
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In Figs. 4 thmugt 8, reference is made occasicma^ n is an integer equal to 

the nimiber of operations or requests that rmght be required f^^ An 
exainple using Fig. 4 wiU be provided wHch iUustr^s its use in 

Taking line 109 on Fig* 4 as an example, in a request for a cached read-invalidate to a 
5 local address, column 102 provides for the instance of a shared miss. Column 103 is provided 
the term GO My=n, column 104 providing a blank and column 105 providing the definition of a 
request to sharers as being n*RCL In this example, a processor has made a request for a local 
address, the data of which it intends to modify. To maintain coherency, it requests data and 
wants all other copies of such data in the system to be invalidated because it plans to "own" 

10 suchdata. However, in the example utilizing a shared miss, there are other 

system that have a cached read only copy in addition to the copy in the requestor's local 
memory. For this reason the processor can go straight to memory for such data because the 
only other copies are shared copies. Therefore, what is in memory is not stale by definitioa In 
the example, the processor reads the data as suggested in column 103 where the response to 

15 the requestor is defined as GO, meaning that the processor may continue the process of having 
the data looked up because the anticipated data to be read is valid and that processor is clear 
to use the data. 

In continuing &e operation defined in the present invention, the reni^^ 
were earher sharing the data must be informed that the data subject to the pre^^ 

20 no longer valid and such quads no longer have valid copies of same. Accordingly, the invention 
provides that such memory is defined now as stale because of the operation. Column 105 
provides that n copies of an RCI (which means a remote cache line invalidate), are sent to other 
sharers. Each quad (a total ofn quads) earUer sharing said data is now informed that the cache 
line operated upon is now invalid. In the exa3iq)le at line 109, colurnn 103 indicates the number 

25 of invalidate acknowledgments to be expected in the system (inv=n). Accordingly, inv=n 
inatches the n*RCI. The system operation defined in the example is not complete unti^ 
received and n invalidate acknowledgments are returned. In the example, it can be appreciated 
that there is only one response to the requesting quad since there is only one requesting quad. 
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In line 109, there can only be one request to the home quad in that there is only one home quad 
as well as one request to the owner quad because there is only one quad defined as the owner. 
In colunm 105 it can be appreciated in the example given and in the preferred 
disclosed that there can be up to three sharing quads since a total of four quads is provided in 
5 the embodiment 

With the illustrative exanq)le it can be appreciated that Figs. 4, 5, 6, 7 and 8 provide a 
complete operating scheme and control protocol which precisely defines the operation of the 
^aratus and method defined herein. TheQ/Acolumnof Fig. 9 indicates whether the 
mnemonic is associated witii a request (Q==request) or a reply (A=answer). 

10 The present invention can be employed in any mdtiprocessor system that utili2es a 

central control device or system to communicate between a group of microprocessors. The 
invention is most beneficial when used in conjunction with a tag and address crossbar system 
along with a data crossbar system which attaches multiple groups of processors employing non- 
uniform memory access or distributed memory across the system. The preferred embodiment 

15 sj^tems and method which allows maintaining cache coherency in a multinode system through 
tracking data states witiiin the data tag and address crossbar controller in such systems as 
shown and described in detail is fidly capable of obtaining the objectives of the itiventioa 
However, it should be understood that the described embodim^t is merely an example of the 
present invention, and as such, is representative of subject matter which is broadly 

20 contemplated by the present invention. 

For example, tile preferred embodiment is described above in the context of a 
particular system which utilizes sixteen microprocessors, comprised of quads of four separate 
groups of four processors, with each quad having a memory control agent which interfaces with 
the central controller crossbar, having memory boards allocated to the quad and for which the 
25 preferred embodiment fimctions to conimunicate through other subsystems to like conto 
the other quads* Nevertheless, the present invention may be used with any system having 
multiple processors, whether grouped into "nodes" or not, with separate memory control 
agents assigned to control each separate group of one or more processors when such group of 
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processors requires coherence or coordination in handling data read or write commands or 
transaction requests, within a multi-node system. 

The present invention is not necessarily limited to the specific numbers of processors or 
the array of processors disclosed, but may be used in similar system design using 
interconnected memory control systems with tag and address and data communication systems 
between the nodes to infiplement the present inventioa Accordingly, the scope of the present 
invention My encompasses other embodiments which may become apparent to those skilled in 
the art. 
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